Visual analysis of incorrect predictions with images and model reasoning
Total Errors: 245
False Positives: 104 (Model said merge, but shouldn't)
False Negatives: 141 (Model said no merge, but should)
Showing: First 20 examples
| Operation ID | 1257743 |
|---|---|
| Segment ID | 864691134580600413 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257743 |
|---|---|
| Segment ID | 864691134580600413 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257743 |
|---|---|
| Segment ID | 864691134580600413 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257743 |
|---|---|
| Segment ID | 864691134580600413 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257743 |
|---|---|
| Segment ID | 864691134580600413 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257743 |
|---|---|
| Segment ID | 864691134580600413 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257743 |
|---|---|
| Segment ID | 864691134580600413 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257743 |
|---|---|
| Segment ID | 864691134580600413 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257848 |
|---|---|
| Segment ID | 864691135013908886 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | π Merge |
| Ground Truth | β Should Not Merge |
| Operation ID | 1255508 |
|---|---|
| Segment ID | 864691136372236424 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | π Merge |
| Ground Truth | β Should Not Merge |
| Operation ID | 1255508 |
|---|---|
| Segment ID | 864691135293484214 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1255508 |
|---|---|
| Segment ID | 864691135293484214 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1255508 |
|---|---|
| Segment ID | 864691135293484214 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1255278 |
|---|---|
| Segment ID | 864691135209755885 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | π Merge |
| Ground Truth | β Should Not Merge |
| Operation ID | 1255278 |
|---|---|
| Segment ID | 864691135209755885 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | π Merge |
| Ground Truth | β Should Not Merge |
| Operation ID | 1257735 |
|---|---|
| Segment ID | 864691135726099115 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257735 |
|---|---|
| Segment ID | 864691135726099115 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257735 |
|---|---|
| Segment ID | 864691135726099115 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | β No Merge |
| Ground Truth | π Should Merge |
| Operation ID | 1257735 |
|---|---|
| Segment ID | 864691135014342646 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | π Merge |
| Ground Truth | β Should Not Merge |
| Operation ID | 1257735 |
|---|---|
| Segment ID | 864691135014342646 |
| Model | o4-mini |
| Prompt Mode | informative+heuristic1+heuristic2+heuristic3+heuristic4+heuristic5+heuristic6+heuristic7 |
| Model Prediction | π Merge |
| Ground Truth | β Should Not Merge |